Average Reward Optimization Objective In Partially Observable Domains
نویسندگان
چکیده
We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs) (Littman et al., 2001). The key to average-reward computation is to have a welldefined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.
منابع مشابه
Average Reward Optimization Objective In Partially Observable Domains - Supplementary Material
Figure 3. The first two plots describe the behavior of the system which rotates the state by a → 10 • and b → −10 • , with the reset states aligned in a way that taking the opposite action brings the system to its topmost point. The x and y in the plots represent possible actions such that x = y. The bottom plot demonstrates how the average reward changes as a function of α and β, where the pol...
متن کاملAnalyzing and Escaping Local Optima in Planning as Inference for Partially Observable Domains
Planning as inference recently emerged as a versatile approach to decision-theoretic planning and reinforcement learning for single and multi-agent systems in fully and partially observable domains with discrete and continuous variables. Since planning as inference essentially tackles a non-convex optimization problem when the states are partially observable, there is a need to develop techniqu...
متن کاملEscaping local optima in POMDP planning as inference
Planning as inference recently emerged as a versatile approach to decision-theoretic planning and reinforcement learning for single and multi-agent systems in fully and partially observable domains with discrete and continuous variables. Since planning as inference essentially tackles a non-convex optimization problem when the states are partially observable, there is a need to develop techniqu...
متن کاملExpectation Maximization for Average Reward Decentralized POMDPs
Planning for multiple agents under uncertainty is often based on decentralized partially observable Markov decision processes (DecPOMDPs), but current methods must de-emphasize long-term effects of actions by a discount factor. In tasks like wireless networking, agents are evaluated by average performance over time, both short and longterm effects of actions are crucial, and discounting based s...
متن کاملEfficient Planning for Factored Infinite-Horizon DEC-POMDPs
Decentralized partially observable Markov decision processes (DEC-POMDPs) are used to plan policies for multiple agents that must maximize a joint reward function but do not communicate with each other. The agents act under uncertainty about each other and the environment. This planning task arises in optimization of wireless networks, and other scenarios where communication between agents is r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013